library(tidyverse)library(dplyr)data <-read.delim("bodyfat.txt")# arrange data in ascending order data_order <- data %>%arrange(Pct.BF) # override weight and height columns in pounds and inches to kilograms and meters, categorise ages, create BMI variabledata1 <- data_order %>%mutate(Height = (Height/39.37)) %>%mutate(Weight = (Weight/2.205)) %>%mutate(BMI = (Weight/(Height)^2))# only keep percentage body fat values greater than 3% (remove two points)df <-subset(data1, Pct.BF>=3)# remove densitydf <- df[ , !names(df) %in%c("Density")] ## works as expected
Josh mentioned his group used weight and BMI (indirectly includes height)?? - neither model output includes either variable
2.1 Backward stepwise selection
Code
# intercept-only modelintercept_only <-lm(Pct.BF ~1, data = df)# model with all predictorsmodel <-lm(Pct.BF ~ ., data = df) #Added intercept back in# backward stepwise regressionbackward <-step(model, direction ='backward', scope =formula(model), trace =TRUE)
# intercept-only modelintercept_only <-lm(Pct.BF ~1, data = df)# model with all predictorsmodel <-lm(Pct.BF ~ ., data = df)# forward stepwise regressionforward <-step(intercept_only, direction ='forward', scope =formula(model), trace =TRUE)
The forward AIC model has an R-squared value of 0.738 while the backward model has an value of 0.742. It seems that backward AIC models fits the dataset a little better.
We use 10 fold cross validation here over 5 fold since there are 248 observations in the dataset and we allow more observations to train the model. Using 10 fold cross validation, the backward AIC model has slight advantage over forward AIC as shown below.